This is the first notebook in this project.

This notebook details how I looked at my interest in aviation data.

Flight operation data were obtained from United States Department of Transportation.

Now that the data is imported, I want to state the scope, data limitations, and objectives:

  1. I was only interested in the three main airports serving New York City: La Guardia, Newark, and John F Kennedy. However, there was a lack of complete datasets originating from Newark Liberty Airport. Instead, I shifted my area of investigation to Los Angeles Intl Airport.
  2. The dataset is limited to domestic flights originating and terminating within the United States
  3. I am only interested in departure data; that is, flights leaving from the three aforementioned airports
  4. If there are any underserved states/destinations
  5. Any trends or observations from the data
  6. Whether I am able to create a supervised learning model that can predict flight delays
  7. Concept of a “hub captive”
# Filters for flights departing
LAX_outbound_flight_data <- raw_flight_data %>%
  filter(., ORIGIN_AIRPORT_ID %in% LAX_airport)

The following table summarises some of the most frequently flown routes originating from LAX in the first six months of 2019.

flight_destinations_by_airport <- LAX_outbound_flight_data %>%
  group_by(ORIGIN, DEST) %>%
  summarise(number_of_flights = n(),
            mean_distance = mean(DISTANCE)) %>%
  arrange(desc(number_of_flights))
datatable(flight_destinations_by_airport)

There is another way to look at it. We know that New York City is served by three airports and it is not the only major city to be served by more than one airport. If we think about it that way,

flight_destinations_by_city <- LAX_outbound_flight_data %>%
  group_by(ORIGIN, DEST_CITY_NAME) %>%
  summarise(number_of_flights = n(),
            mean_distance = round(mean(DISTANCE),2)) %>%
  arrange(desc(number_of_flights))
datatable(flight_destinations_by_city)

Or by what the Department of Transporation defines as a destination city market.

A random thought: Can destination city markets be represented as Thiessen polygons and the distance to the nearest airport be mapped/calculated?

flight_destinations_by_market <- LAX_outbound_flight_data %>%
  group_by(ORIGIN, DEST_CITY_MARKET_ID) %>%
  summarise(number_of_flights = n(),
            mean_distance = round(mean(DISTANCE),2)) %>%
  arrange(desc(number_of_flights)) %>%
  inner_join(.,city_market, by = c("DEST_CITY_MARKET_ID" = "Code"))
datatable(flight_destinations_by_market)

The distribution can be represented as such.

frequency_histogram <- ggplot(flight_destinations_by_market, aes(number_of_flights)) + 
  geom_histogram(binwidth = 200)
ggplotly(frequency_histogram)

Another question: Just because Los Angeles is on the West Coast, does it mean that it mainly serves West Coast and American Southwest destinations?

frequency_distance_scatterplot <- ggplot(flight_destinations_by_market, aes(x = mean_distance, y = number_of_flights)) + 
  geom_point()
ggplotly(frequency_distance_scatterplot)

Apparently not. But what if we look at the spatial distribution of destinations connected to Los Angeles at the state level?

It appears that states adjacent to California tend to receive more flights than those farther away. However there are some exceptions such as IL, GA, FL, NY, and HI. With the exception of HI, IL, NY, GA, and FL are home to the hubs of the three main US carriers. HI’s high degree of connectivity with LAX is due to its geographical isolation from any other state bar California. Hence, the four main US carriers (Alaska, American, United, and Delta) and Hawaiian use LAX as a gateway to Hawaii due to the high numbers of feeder flights from across the United States into each of the carriers’ hub terminals at LAX.

# Create map
# Aggregate flights by state
lax_destinations_state <- LAX_outbound_flight_data %>%
  group_by(DEST_STATE_ABR, MONTH) %>%
  summarise(monthly_sum_flights = n()) %>%
  group_by(DEST_STATE_ABR) %>%
  summarise(mean_monthly_flights = round(mean(monthly_sum_flights)))

# Super important: Even though imported shp behave like dfs, merging with a df using anything but merge() from sp returns a df NOT shp
lax_destinations_state_shp <- merge(US_state_map, lax_destinations_state, by.y = "DEST_STATE_ABR", by.x = "STUSPS", all.x = T)
lax_destination_state_map_layer <- tm_shape(lax_destinations_state_shp) + 
  tm_polygons(col = "mean_monthly_flights", border.col = NA, palette = "viridis") + 
  tm_text("STUSPS")
lax_destinations_state_leaflet <- tmap_leaflet(x = lax_destination_state_map_layer)
lax_destinations_state_leaflet
#lax_destinations_state_airline <- LAX_outbound_flight_data %>%
#  group_by(DEST_STATE_ABR, OP_CARRIER_AIRLINE_ID, MONTH) %>%
#  summarise(monthly_sum_flights = n()) %>%
#  group_by(DEST_STATE_ABR, OP_CARRIER_AIRLINE_ID) %>%
#  summarise(mean_monthly_flights = round(mean(monthly_sum_flights))) %>%
#  inner_join(., airline_ID, by = c("OP_CARRIER_AIRLINE_ID" = "Code"))

We are often told that competition increases choice and benefits the consumer. But what do these benefits really look like? Lower fare prices? Better flight timings? Improved punctuality? Better service on board?

On fare prices. It is probably safe to say that almost everyone on any given flight paid a different price for their ticket, barring any promotional offers. Airlines will not divulge that information anyway. So it is impossible for us to look at the financial dimension without working relying extensively on pre-processed and aggregated statistics supplied by the Department of Transportation.

Depending on how observant you are, flight timings can either be a curious phenomenon or something that makes instinctive sense, like of course it will be scheduled like this. But instead of looking at how spaced out throughout the day the flights are, we can look at how close they are to their competitors’.

Second, we can also look at how punctual flights are. It is okay for them to arrive early, but not late.

late_flights <- LAX_outbound_flight_data %>%
  filter(ARR_DEL15 > 0) %>%
  group_by(OP_CARRIER_AIRLINE_ID) %>%
  summarise(number_delayed_flights = n())
ontime_early_flights <- LAX_outbound_flight_data %>%
  filter(ARR_DEL15 == 0) %>%
  group_by(OP_CARRIER_AIRLINE_ID) %>%
  summarise(number_ontime_early_flights = n())
LAX_airlines_punctuality_summary <- inner_join(ontime_early_flights, late_flights, by = ("OP_CARRIER_AIRLINE_ID")) %>%
  mutate(pct_late = round(number_delayed_flights/(number_delayed_flights + number_ontime_early_flights)*100,1),
         pct_ontime_early = round(number_ontime_early_flights/(number_delayed_flights + number_ontime_early_flights)*100,1)) %>%
  arrange(desc(pct_late)) %>%
  inner_join(., airline_ID, by = c("OP_CARRIER_AIRLINE_ID" = "Code"))
datatable(LAX_airlines_punctuality_summary)

Surprised by the results? Further, we need to take into account that the Department of Transporation classifies a delayed flight as one that arrived more than 15 minutes after its scheduled arrival time.

To look at these two things, we will pick out Destination City Markets that are served by more than one airline from LAX, which ought to result in competitive behavior.

#Departure delays do not matter as long as the flights can still get there on time given the amount of padding 
#People make decisions and plans around the scheduled arrival time, so if it is late on getting out but on time getting in, it is fine
#How often flights are delayed among similarly competitive routes

## First count the number of airlines operating to each airport and not city market beause each airport is run differently
## For each airline serving that airport, what is the proportion of flights that arrive there late, regardless of departure from LAX 
## Grouping by airline and the number of competitors, find out how late on average an airline is

LAX_outbound_flight_data <- LAX_outbound_flight_data %>%
  group_by(DEST_AIRPORT_ID) %>%
  mutate(number_airlines_serving_route = n_distinct(OP_CARRIER_AIRLINE_ID)) %>%
  ungroup()
delay_by_competitiveness <- LAX_outbound_flight_data %>%
  group_by(DEST_AIRPORT_ID, OP_CARRIER_AIRLINE_ID) %>%
  mutate(number_flights = round(n())) %>%
  ungroup() %>%
  filter(ARR_DELAY > 0) %>%
  group_by(DEST_AIRPORT_ID, OP_CARRIER_AIRLINE_ID, number_airlines_serving_route) %>%
  summarise(mean_delay = round(mean(ARR_DELAY), 2),
            mean_number_delayed_flights = round(mean(n())),
            mean_prop_delayed_flights = round(mean_number_delayed_flights/mean(number_flights)*100, 2)) %>%
  group_by(OP_CARRIER_AIRLINE_ID, number_airlines_serving_route) %>%
  summarise(mean_delay_time = round(mean(mean_delay), 2),
            mean_prop_delayed_flights = round(mean(mean_prop_delayed_flights), 2)) %>%
  inner_join(., airline_ID, by = c("OP_CARRIER_AIRLINE_ID" = "Code")) %>%
  arrange(OP_CARRIER_AIRLINE_ID, desc(mean_delay_time))
delay_scatter <- ggplot(delay_by_competitiveness, aes(x = number_airlines_serving_route, y = mean_delay_time)) + 
  geom_point(aes(color = Description))
ggplotly(delay_scatter)
competitiveness_delay_regression <- lm(mean_delay_time ~ number_airlines_serving_route + Description + mean_prop_delayed_flights, data = delay_by_competitiveness)
model_fitted <- augment(competitiveness_delay_regression, type.predict = "response")
summary(model_fitted)
##  mean_delay_time      number_airlines_serving_route Description       
##  Min.   : 9.6700000   Min.   :1.00000000            Length:59         
##  1st Qu.:32.0300000   1st Qu.:2.50000000            Class :character  
##  Median :37.4700000   Median :4.00000000            Mode  :character  
##  Mean   :37.9415254   Mean   :4.16949153                              
##  3rd Qu.:44.0400000   3rd Qu.:6.00000000                              
##  Max.   :66.5000000   Max.   :8.00000000                              
##  mean_prop_delayed_flights    .fitted              .se.fit          
##  Min.   :11.7600000        Min.   :23.2310940   Min.   :2.87425181  
##  1st Qu.:28.0800000        1st Qu.:33.6831387   1st Qu.:3.00832584  
##  Median :31.9000000        Median :38.9192234   Median :3.21541215  
##  Mean   :32.5932203        Mean   :37.9415254   Mean   :3.47753304  
##  3rd Qu.:35.7850000        3rd Qu.:41.9935141   3rd Qu.:3.56670293  
##  Max.   :59.7900000        Max.   :56.0085853   Max.   :5.63083635  
##      .resid                  .hat                 .sigma          
##  Min.   :-19.04471522   Min.   :0.144234010   Min.   :6.97733914  
##  1st Qu.: -3.74132299   1st Qu.:0.158004287   1st Qu.:7.52609778  
##  Median : -0.45300532   Median :0.180505878   Median :7.61421630  
##  Mean   :  0.00000000   Mean   :0.220338983   Mean   :7.56074615  
##  3rd Qu.:  5.21865305   3rd Qu.:0.222112283   3rd Qu.:7.64536277  
##  Max.   : 11.45392292   Max.   :0.553558932   Max.   :7.65179647  
##     .cooksd                .std.resid            
##  Min.   :0.00000064243   Min.   :-2.78421764843  
##  1st Qu.:0.00131591516   1st Qu.:-0.54025207704  
##  Median :0.00794340822   Median :-0.06508904839  
##  Mean   :0.03685270637   Mean   :-0.00240660489  
##  3rd Qu.:0.03326151478   3rd Qu.: 0.85297957760  
##  Max.   :0.36728210364   Max.   : 2.03437962066
model_plot <- ggplot(delay_by_competitiveness, aes(number_airlines_serving_route, mean_delay_time, color = Description)) + 
  geom_point() + 
  geom_line(data = model_fitted, aes(y = .fitted))
ggplotly(model_plot)